EDA & Data Preprocessing on Google App Store Rating Dataset.¶

1. Import required libraries and read the dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [2]:
df=pd.read_csv("Apps_data.csv")
In [3]:
df
Out[3]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device

10841 rows × 13 columns

2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features.

In [4]:
df.head(10)
Out[4]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up
6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up
7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 6.1.61.1 4.2 and up
8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up
9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up
In [5]:
df.shape
Out[5]:
(10841, 13)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

3. Check summary statistics of the dataset. List out the columns that need to be worked upon for model building.

In [7]:
df.describe()
Out[7]:
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
In [8]:
df.describe(include='object')
Out[8]:
App Category Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
count 10841 10841 10841 10841 10841 10840 10841 10840 10841 10841 10833 10838
unique 9660 34 6002 462 22 3 93 6 120 1378 2832 33
top ROBLOX FAMILY 0 Varies with device 1,000,000+ Free 0 Everyone Tools August 3, 2018 Varies with device 4.1 and up
freq 9 1972 596 1695 1579 10039 10040 8714 842 326 1459 2451
In [9]:
mc=df.loc[:, df.columns != 'Rating'].columns
print('The columns that need to be worked upon for model building are')
mc
The columns that need to be worked upon for model building are
Out[9]:
Index(['App', 'Category', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
       'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

4. Check if there are any duplicate records in the dataset? if any drop them.

In [10]:
df.duplicated().sum()
Out[10]:
483
In [11]:
df.drop_duplicates(inplace=True)

5. Check the unique categories of the column 'Category', Is there any invalid category? If yes, drop them.

In [12]:
df['Category'].unique()
Out[12]:
array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)
In [13]:
df[df['Category']=='1.9']
Out[13]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
In [14]:
df.drop(index=10472,inplace=True)

6. Check if there are missing values present in the column Rating, If any? drop them and and create a new column as 'Rating_category' by converting ratings to high and low categories(>3.5 is high rest low)

In [15]:
df['Rating'].isna().sum()
Out[15]:
1465
In [16]:
df.isnull().sum()
Out[16]:
App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64
In [17]:
df.dropna(subset='Rating',inplace=True)
In [18]:
df['Rating']
Out[18]:
0        4.1
1        3.9
2        4.7
3        4.5
4        4.3
        ... 
10834    4.0
10836    4.5
10837    5.0
10839    4.5
10840    4.5
Name: Rating, Length: 8892, dtype: float64
In [19]:
df=df[df['Category'].isnull()==False]
In [20]:
df
Out[20]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device

8892 rows × 13 columns

In [21]:
df['Rating_category']=df['Rating'].apply(lambda x:'High' if x > 3.5 else 'Low')
In [22]:
df['Rating_category']
Out[22]:
0        High
1        High
2        High
3        High
4        High
         ... 
10834    High
10836    High
10837    High
10839    High
10840    High
Name: Rating_category, Length: 8892, dtype: object
In [23]:
df['Rating_category'].unique()
Out[23]:
array(['High', 'Low'], dtype=object)

7. Check the distribution of the newly created column 'Rating_category' and comment on the distribution.

In [24]:
df['Rating_category'].value_counts()
Out[24]:
High    8012
Low      880
Name: Rating_category, dtype: int64
  • After dropping the missing values the dataset has 8892 index.
  • The 90% of the data is distributed to high rating category
  • The 10% of the data is distributed to low rating category

8. Convert the column "Reviews'' to numeric data type and check the presence of outliers in the column and handle the outliers using a transformation approach.(Hint: Use log transformation)

In [25]:
df['Reviews'].dtype
Out[25]:
dtype('O')
In [26]:
df['Reviews'].unique()
Out[26]:
array(['159', '967', '87510', ..., '603', '1195', '398307'], dtype=object)
In [27]:
df['Reviews']=df['Reviews'].astype('int')
In [28]:
df['Reviews'].dtype
Out[28]:
dtype('int32')
In [29]:
import plotly.express as px
px.box(df['Reviews'])
In [30]:
df['Reviews']=np.log1p(df['Reviews'])
In [31]:
px.box(df['Reviews'])

9. The column 'Size' contains alphanumeric values, treat the non numeric data and convert the column into suitable data type. (hint: Replace M with 1 million and K with 1 thousand, and drop the entries where size='Varies with device')

In [32]:
df['Size'].dtype
Out[32]:
dtype('O')
In [33]:
df['Size'].unique()
Out[33]:
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M',
       '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M',
       '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M',
       '5.7M', '8.6M', '2.4M', '27M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
       '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M',
       '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M',
       '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M',
       '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k',
       '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M',
       '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k',
       '41k', '292k', '11k', '80M', '1.7M', '10.0M', '74M', '62M', '69M',
       '75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M',
       '81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M',
       '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k',
       '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k',
       '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k',
       '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k',
       '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k',
       '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k',
       '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k',
       '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k',
       '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k',
       '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k',
       '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k',
       '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k',
       '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k',
       '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k',
       '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k',
       '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k',
       '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k',
       '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k',
       '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k',
       '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k',
       '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k',
       '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k',
       '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k',
       '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k',
       '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k',
       '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k',
       '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'],
      dtype=object)
In [34]:
df['Size']=df['Size'].replace({'M':'*10**6','k':'*10**3','Varies with device':np.nan},regex=True)
In [35]:
df['Size'].unique()
Out[35]:
array(['19*10**6', '14*10**6', '8.7*10**6', '25*10**6', '2.8*10**6',
       '5.6*10**6', '29*10**6', '33*10**6', '3.1*10**6', '28*10**6',
       '12*10**6', '20*10**6', '21*10**6', '37*10**6', '2.7*10**6',
       '5.5*10**6', '17*10**6', '39*10**6', '31*10**6', '4.2*10**6',
       '23*10**6', '6.0*10**6', '6.1*10**6', '4.6*10**6', '9.2*10**6',
       '5.2*10**6', '11*10**6', '24*10**6', nan, '9.4*10**6', '15*10**6',
       '10*10**6', '1.2*10**6', '26*10**6', '8.0*10**6', '7.9*10**6',
       '56*10**6', '57*10**6', '35*10**6', '54*10**6', '201*10**3',
       '3.6*10**6', '5.7*10**6', '8.6*10**6', '2.4*10**6', '27*10**6',
       '2.5*10**6', '7.0*10**6', '16*10**6', '3.4*10**6', '8.9*10**6',
       '3.9*10**6', '2.9*10**6', '38*10**6', '32*10**6', '5.4*10**6',
       '18*10**6', '1.1*10**6', '2.2*10**6', '4.5*10**6', '9.8*10**6',
       '52*10**6', '9.0*10**6', '6.7*10**6', '30*10**6', '2.6*10**6',
       '7.1*10**6', '22*10**6', '6.4*10**6', '3.2*10**6', '8.2*10**6',
       '4.9*10**6', '9.5*10**6', '5.0*10**6', '5.9*10**6', '13*10**6',
       '73*10**6', '6.8*10**6', '3.5*10**6', '4.0*10**6', '2.3*10**6',
       '2.1*10**6', '42*10**6', '9.1*10**6', '55*10**6', '23*10**3',
       '7.3*10**6', '6.5*10**6', '1.5*10**6', '7.5*10**6', '51*10**6',
       '41*10**6', '48*10**6', '8.5*10**6', '46*10**6', '8.3*10**6',
       '4.3*10**6', '4.7*10**6', '3.3*10**6', '40*10**6', '7.8*10**6',
       '8.8*10**6', '6.6*10**6', '5.1*10**6', '61*10**6', '66*10**6',
       '79*10**3', '8.4*10**6', '3.7*10**6', '118*10**3', '44*10**6',
       '695*10**3', '1.6*10**6', '6.2*10**6', '53*10**6', '1.4*10**6',
       '3.0*10**6', '7.2*10**6', '5.8*10**6', '3.8*10**6', '9.6*10**6',
       '45*10**6', '63*10**6', '49*10**6', '77*10**6', '4.4*10**6',
       '70*10**6', '9.3*10**6', '8.1*10**6', '36*10**6', '6.9*10**6',
       '7.4*10**6', '84*10**6', '97*10**6', '2.0*10**6', '1.9*10**6',
       '1.8*10**6', '5.3*10**6', '47*10**6', '556*10**3', '526*10**3',
       '76*10**6', '7.6*10**6', '59*10**6', '9.7*10**6', '78*10**6',
       '72*10**6', '43*10**6', '7.7*10**6', '6.3*10**6', '334*10**3',
       '93*10**6', '65*10**6', '79*10**6', '100*10**6', '58*10**6',
       '50*10**6', '68*10**6', '64*10**6', '34*10**6', '67*10**6',
       '60*10**6', '94*10**6', '9.9*10**6', '232*10**3', '99*10**6',
       '624*10**3', '95*10**6', '8.5*10**3', '41*10**3', '292*10**3',
       '11*10**3', '80*10**6', '1.7*10**6', '10.0*10**6', '74*10**6',
       '62*10**6', '69*10**6', '75*10**6', '98*10**6', '85*10**6',
       '82*10**6', '96*10**6', '87*10**6', '71*10**6', '86*10**6',
       '91*10**6', '81*10**6', '92*10**6', '83*10**6', '88*10**6',
       '704*10**3', '862*10**3', '899*10**3', '378*10**3', '4.8*10**6',
       '266*10**3', '375*10**3', '1.3*10**6', '975*10**3', '980*10**3',
       '4.1*10**6', '89*10**6', '696*10**3', '544*10**3', '525*10**3',
       '920*10**3', '779*10**3', '853*10**3', '720*10**3', '713*10**3',
       '772*10**3', '318*10**3', '58*10**3', '241*10**3', '196*10**3',
       '857*10**3', '51*10**3', '953*10**3', '865*10**3', '251*10**3',
       '930*10**3', '540*10**3', '313*10**3', '746*10**3', '203*10**3',
       '26*10**3', '314*10**3', '239*10**3', '371*10**3', '220*10**3',
       '730*10**3', '756*10**3', '91*10**3', '293*10**3', '17*10**3',
       '74*10**3', '14*10**3', '317*10**3', '78*10**3', '924*10**3',
       '818*10**3', '81*10**3', '939*10**3', '169*10**3', '45*10**3',
       '965*10**3', '90*10**6', '545*10**3', '61*10**3', '283*10**3',
       '655*10**3', '714*10**3', '93*10**3', '872*10**3', '121*10**3',
       '322*10**3', '976*10**3', '206*10**3', '954*10**3', '444*10**3',
       '717*10**3', '210*10**3', '609*10**3', '308*10**3', '306*10**3',
       '175*10**3', '350*10**3', '383*10**3', '454*10**3', '1.0*10**6',
       '70*10**3', '812*10**3', '442*10**3', '842*10**3', '417*10**3',
       '412*10**3', '459*10**3', '478*10**3', '335*10**3', '782*10**3',
       '721*10**3', '430*10**3', '429*10**3', '192*10**3', '460*10**3',
       '728*10**3', '496*10**3', '816*10**3', '414*10**3', '506*10**3',
       '887*10**3', '613*10**3', '778*10**3', '683*10**3', '592*10**3',
       '186*10**3', '840*10**3', '647*10**3', '373*10**3', '437*10**3',
       '598*10**3', '716*10**3', '585*10**3', '982*10**3', '219*10**3',
       '55*10**3', '323*10**3', '691*10**3', '511*10**3', '951*10**3',
       '963*10**3', '25*10**3', '554*10**3', '351*10**3', '27*10**3',
       '82*10**3', '208*10**3', '551*10**3', '29*10**3', '103*10**3',
       '116*10**3', '153*10**3', '209*10**3', '499*10**3', '173*10**3',
       '597*10**3', '809*10**3', '122*10**3', '411*10**3', '400*10**3',
       '801*10**3', '787*10**3', '50*10**3', '643*10**3', '986*10**3',
       '516*10**3', '837*10**3', '780*10**3', '20*10**3', '498*10**3',
       '600*10**3', '656*10**3', '221*10**3', '228*10**3', '176*10**3',
       '34*10**3', '259*10**3', '164*10**3', '458*10**3', '629*10**3',
       '28*10**3', '288*10**3', '775*10**3', '785*10**3', '636*10**3',
       '916*10**3', '994*10**3', '309*10**3', '485*10**3', '914*10**3',
       '903*10**3', '608*10**3', '500*10**3', '54*10**3', '562*10**3',
       '847*10**3', '948*10**3', '811*10**3', '270*10**3', '48*10**3',
       '523*10**3', '784*10**3', '280*10**3', '24*10**3', '892*10**3',
       '154*10**3', '18*10**3', '33*10**3', '860*10**3', '364*10**3',
       '387*10**3', '626*10**3', '161*10**3', '879*10**3', '39*10**3',
       '170*10**3', '141*10**3', '160*10**3', '144*10**3', '143*10**3',
       '190*10**3', '376*10**3', '193*10**3', '473*10**3', '246*10**3',
       '73*10**3', '253*10**3', '957*10**3', '420*10**3', '72*10**3',
       '404*10**3', '470*10**3', '226*10**3', '240*10**3', '89*10**3',
       '234*10**3', '257*10**3', '861*10**3', '467*10**3', '676*10**3',
       '552*10**3', '582*10**3', '619*10**3'], dtype=object)
In [36]:
df['Size']=df['Size'][df['Size'].isnull()==False].map(eval)
In [37]:
df['Size'].unique()
Out[37]:
array([1.90e+07, 1.40e+07, 8.70e+06, 2.50e+07, 2.80e+06, 5.60e+06,
       2.90e+07, 3.30e+07, 3.10e+06, 2.80e+07, 1.20e+07, 2.00e+07,
       2.10e+07, 3.70e+07, 2.70e+06, 5.50e+06, 1.70e+07, 3.90e+07,
       3.10e+07, 4.20e+06, 2.30e+07, 6.00e+06, 6.10e+06, 4.60e+06,
       9.20e+06, 5.20e+06, 1.10e+07, 2.40e+07,      nan, 9.40e+06,
       1.50e+07, 1.00e+07, 1.20e+06, 2.60e+07, 8.00e+06, 7.90e+06,
       5.60e+07, 5.70e+07, 3.50e+07, 5.40e+07, 2.01e+05, 3.60e+06,
       5.70e+06, 8.60e+06, 2.40e+06, 2.70e+07, 2.50e+06, 7.00e+06,
       1.60e+07, 3.40e+06, 8.90e+06, 3.90e+06, 2.90e+06, 3.80e+07,
       3.20e+07, 5.40e+06, 1.80e+07, 1.10e+06, 2.20e+06, 4.50e+06,
       9.80e+06, 5.20e+07, 9.00e+06, 6.70e+06, 3.00e+07, 2.60e+06,
       7.10e+06, 2.20e+07, 6.40e+06, 3.20e+06, 8.20e+06, 4.90e+06,
       9.50e+06, 5.00e+06, 5.90e+06, 1.30e+07, 7.30e+07, 6.80e+06,
       3.50e+06, 4.00e+06, 2.30e+06, 2.10e+06, 4.20e+07, 9.10e+06,
       5.50e+07, 2.30e+04, 7.30e+06, 6.50e+06, 1.50e+06, 7.50e+06,
       5.10e+07, 4.10e+07, 4.80e+07, 8.50e+06, 4.60e+07, 8.30e+06,
       4.30e+06, 4.70e+06, 3.30e+06, 4.00e+07, 7.80e+06, 8.80e+06,
       6.60e+06, 5.10e+06, 6.10e+07, 6.60e+07, 7.90e+04, 8.40e+06,
       3.70e+06, 1.18e+05, 4.40e+07, 6.95e+05, 1.60e+06, 6.20e+06,
       5.30e+07, 1.40e+06, 3.00e+06, 7.20e+06, 5.80e+06, 3.80e+06,
       9.60e+06, 4.50e+07, 6.30e+07, 4.90e+07, 7.70e+07, 4.40e+06,
       7.00e+07, 9.30e+06, 8.10e+06, 3.60e+07, 6.90e+06, 7.40e+06,
       8.40e+07, 9.70e+07, 2.00e+06, 1.90e+06, 1.80e+06, 5.30e+06,
       4.70e+07, 5.56e+05, 5.26e+05, 7.60e+07, 7.60e+06, 5.90e+07,
       9.70e+06, 7.80e+07, 7.20e+07, 4.30e+07, 7.70e+06, 6.30e+06,
       3.34e+05, 9.30e+07, 6.50e+07, 7.90e+07, 1.00e+08, 5.80e+07,
       5.00e+07, 6.80e+07, 6.40e+07, 3.40e+07, 6.70e+07, 6.00e+07,
       9.40e+07, 9.90e+06, 2.32e+05, 9.90e+07, 6.24e+05, 9.50e+07,
       8.50e+03, 4.10e+04, 2.92e+05, 1.10e+04, 8.00e+07, 1.70e+06,
       7.40e+07, 6.20e+07, 6.90e+07, 7.50e+07, 9.80e+07, 8.50e+07,
       8.20e+07, 9.60e+07, 8.70e+07, 7.10e+07, 8.60e+07, 9.10e+07,
       8.10e+07, 9.20e+07, 8.30e+07, 8.80e+07, 7.04e+05, 8.62e+05,
       8.99e+05, 3.78e+05, 4.80e+06, 2.66e+05, 3.75e+05, 1.30e+06,
       9.75e+05, 9.80e+05, 4.10e+06, 8.90e+07, 6.96e+05, 5.44e+05,
       5.25e+05, 9.20e+05, 7.79e+05, 8.53e+05, 7.20e+05, 7.13e+05,
       7.72e+05, 3.18e+05, 5.80e+04, 2.41e+05, 1.96e+05, 8.57e+05,
       5.10e+04, 9.53e+05, 8.65e+05, 2.51e+05, 9.30e+05, 5.40e+05,
       3.13e+05, 7.46e+05, 2.03e+05, 2.60e+04, 3.14e+05, 2.39e+05,
       3.71e+05, 2.20e+05, 7.30e+05, 7.56e+05, 9.10e+04, 2.93e+05,
       1.70e+04, 7.40e+04, 1.40e+04, 3.17e+05, 7.80e+04, 9.24e+05,
       8.18e+05, 8.10e+04, 9.39e+05, 1.69e+05, 4.50e+04, 9.65e+05,
       9.00e+07, 5.45e+05, 6.10e+04, 2.83e+05, 6.55e+05, 7.14e+05,
       9.30e+04, 8.72e+05, 1.21e+05, 3.22e+05, 9.76e+05, 2.06e+05,
       9.54e+05, 4.44e+05, 7.17e+05, 2.10e+05, 6.09e+05, 3.08e+05,
       3.06e+05, 1.75e+05, 3.50e+05, 3.83e+05, 4.54e+05, 1.00e+06,
       7.00e+04, 8.12e+05, 4.42e+05, 8.42e+05, 4.17e+05, 4.12e+05,
       4.59e+05, 4.78e+05, 3.35e+05, 7.82e+05, 7.21e+05, 4.30e+05,
       4.29e+05, 1.92e+05, 4.60e+05, 7.28e+05, 4.96e+05, 8.16e+05,
       4.14e+05, 5.06e+05, 8.87e+05, 6.13e+05, 7.78e+05, 6.83e+05,
       5.92e+05, 1.86e+05, 8.40e+05, 6.47e+05, 3.73e+05, 4.37e+05,
       5.98e+05, 7.16e+05, 5.85e+05, 9.82e+05, 2.19e+05, 5.50e+04,
       3.23e+05, 6.91e+05, 5.11e+05, 9.51e+05, 9.63e+05, 2.50e+04,
       5.54e+05, 3.51e+05, 2.70e+04, 8.20e+04, 2.08e+05, 5.51e+05,
       2.90e+04, 1.03e+05, 1.16e+05, 1.53e+05, 2.09e+05, 4.99e+05,
       1.73e+05, 5.97e+05, 8.09e+05, 1.22e+05, 4.11e+05, 4.00e+05,
       8.01e+05, 7.87e+05, 5.00e+04, 6.43e+05, 9.86e+05, 5.16e+05,
       8.37e+05, 7.80e+05, 2.00e+04, 4.98e+05, 6.00e+05, 6.56e+05,
       2.21e+05, 2.28e+05, 1.76e+05, 3.40e+04, 2.59e+05, 1.64e+05,
       4.58e+05, 6.29e+05, 2.80e+04, 2.88e+05, 7.75e+05, 7.85e+05,
       6.36e+05, 9.16e+05, 9.94e+05, 3.09e+05, 4.85e+05, 9.14e+05,
       9.03e+05, 6.08e+05, 5.00e+05, 5.40e+04, 5.62e+05, 8.47e+05,
       9.48e+05, 8.11e+05, 2.70e+05, 4.80e+04, 5.23e+05, 7.84e+05,
       2.80e+05, 2.40e+04, 8.92e+05, 1.54e+05, 1.80e+04, 3.30e+04,
       8.60e+05, 3.64e+05, 3.87e+05, 6.26e+05, 1.61e+05, 8.79e+05,
       3.90e+04, 1.70e+05, 1.41e+05, 1.60e+05, 1.44e+05, 1.43e+05,
       1.90e+05, 3.76e+05, 1.93e+05, 4.73e+05, 2.46e+05, 7.30e+04,
       2.53e+05, 9.57e+05, 4.20e+05, 7.20e+04, 4.04e+05, 4.70e+05,
       2.26e+05, 2.40e+05, 8.90e+04, 2.34e+05, 2.57e+05, 8.61e+05,
       4.67e+05, 6.76e+05, 5.52e+05, 5.82e+05, 6.19e+05])
In [38]:
df['Size'].dtype
Out[38]:
dtype('float64')
In [39]:
df['Size'].isnull().sum()
Out[39]:
1468
In [40]:
df.dropna(subset='Size',inplace=True)

10. Check the column 'Installs', treat the unwanted characters and convert the column into a suitable data type.

In [41]:
df['Installs'].unique()
Out[41]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000+', '500,000,000+', '100+', '500+', '10+', '1,000,000,000+',
       '5+', '50+', '1+'], dtype=object)
In [42]:
df['Installs'].replace({',':''},regex=True,inplace = True)
In [43]:
df['Installs']=df['Installs'].str.replace("+","")
In [44]:
df['Installs'].unique()
Out[44]:
array(['10000', '500000', '5000000', '50000000', '100000', '50000',
       '1000000', '10000000', '5000', '100000000', '1000', '500000000',
       '100', '500', '10', '1000000000', '5', '50', '1'], dtype=object)
In [45]:
df['Installs']=df['Installs'].astype('int')
In [46]:
df['Installs'].dtype
Out[46]:
dtype('int32')
In [47]:
df
Out[47]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_category
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 5.075174 19000000.0 10000 Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up High
1 Coloring book moana ART_AND_DESIGN 3.9 6.875232 14000000.0 500000 Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up High
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 11.379520 8700000.0 5000000 Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up High
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 12.281389 25000000.0 50000000 Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up High
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 6.875232 2800000.0 100000 Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up High
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 3.806662 619000.0 1000 Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up High
10834 FR Calculator FAMILY 4.0 2.079442 2600000.0 500 Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up High
10836 Sya9a Maroc - FR FAMILY 4.5 3.663562 53000000.0 5000 Free 0 Everyone Education July 25, 2017 1.48 4.1 and up High
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 1.609438 3600000.0 100 Free 0 Everyone Education July 6, 2018 1.0 4.1 and up High
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 12.894981 19000000.0 10000000 Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device High

7424 rows × 14 columns

11. Check the column 'Price' , remove the unwanted characters and convert the column into a suitable data type.

In [48]:
df['Price'].unique()
Out[48]:
array(['0', '$4.99', '$6.99', '$7.99', '$3.99', '$5.99', '$2.99', '$1.99',
       '$9.99', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99',
       '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$3.49',
       '$10.99', '$7.49', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99',
       '$2.49', '$4.49', '$1.70', '$1.49', '$3.88', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$1.59',
       '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99',
       '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59',
       '$19.40', '$15.46', '$8.99', '$3.04', '$13.99', '$4.29', '$3.28',
       '$4.60', '$1.00', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object)
In [49]:
df['Price']=df['Price'].str.replace("$",'')
In [50]:
df['Price']=df['Price'].astype(float)
df['Price'].dtypes
Out[50]:
dtype('float64')
In [51]:
df['Price'].unique()
Out[51]:
array([  0.  ,   4.99,   6.99,   7.99,   3.99,   5.99,   2.99,   1.99,
         9.99,   0.99,   9.  ,   5.49,  10.  ,  24.99,  11.99,  79.99,
        16.99,  14.99,  29.99,  12.99,   3.49,  10.99,   7.49,   1.5 ,
        19.99,  15.99,  33.99,  39.99,   2.49,   4.49,   1.7 ,   1.49,
         3.88, 399.99,  17.99, 400.  ,   3.02,   1.76,   4.84,   4.77,
         1.61,   1.59,   6.49,   1.29, 299.99, 379.99,  37.99,  18.99,
       389.99,   8.49,   1.75,  14.  ,   2.  ,   3.08,   2.59,  19.4 ,
        15.46,   8.99,   3.04,  13.99,   4.29,   3.28,   4.6 ,   1.  ,
         2.9 ,   1.97,   2.56,   1.2 ])

12. Drop the columns which you think redundant for the analysis.(suggestion: drop column 'rating', since we created a new feature from it (i.e. rating_category) and the columns 'App', 'Rating' ,'Genres','Last Updated', 'Current Ver','Android Ver' columns since which are redundant for our analysis).

In [52]:
df.drop(columns=['Rating','Current Ver','Android Ver','Genres','Last Updated','App'],inplace=True)
In [53]:
df.columns
Out[53]:
Index(['Category', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
       'Content Rating', 'Rating_category'],
      dtype='object')
In [54]:
df
Out[54]:
Category Reviews Size Installs Type Price Content Rating Rating_category
0 ART_AND_DESIGN 5.075174 19000000.0 10000 Free 0.0 Everyone High
1 ART_AND_DESIGN 6.875232 14000000.0 500000 Free 0.0 Everyone High
2 ART_AND_DESIGN 11.379520 8700000.0 5000000 Free 0.0 Everyone High
3 ART_AND_DESIGN 12.281389 25000000.0 50000000 Free 0.0 Teen High
4 ART_AND_DESIGN 6.875232 2800000.0 100000 Free 0.0 Everyone High
... ... ... ... ... ... ... ... ...
10833 BOOKS_AND_REFERENCE 3.806662 619000.0 1000 Free 0.0 Everyone High
10834 FAMILY 2.079442 2600000.0 500 Free 0.0 Everyone High
10836 FAMILY 3.663562 53000000.0 5000 Free 0.0 Everyone High
10837 FAMILY 1.609438 3600000.0 100 Free 0.0 Everyone High
10840 LIFESTYLE 12.894981 19000000.0 10000000 Free 0.0 Everyone High

7424 rows × 8 columns

13. Encode the categorical columns.

In [55]:
col=df.select_dtypes(include='object').columns
for i in col:
    print(i," : ")
    print(df[i].unique())
Category  : 
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION']
Type  : 
['Free' 'Paid']
Content Rating  : 
['Everyone' 'Teen' 'Everyone 10+' 'Mature 17+' 'Adults only 18+' 'Unrated']
Rating_category  : 
['High' 'Low']
In [56]:
le_col=['Category','Content Rating','Rating_category']
ohe_col=['Type']
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le=LabelEncoder()
for i in le_col:
    df[i]=le.fit_transform(df[i])
In [57]:
ohe=OneHotEncoder(sparse=False)
ohe_df=pd.DataFrame(ohe.fit_transform(df[ohe_col]))
ohe_df
Out[57]:
0 1
0 1.0 0.0
1 1.0 0.0
2 1.0 0.0
3 1.0 0.0
4 1.0 0.0
... ... ...
7419 1.0 0.0
7420 1.0 0.0
7421 1.0 0.0
7422 1.0 0.0
7423 1.0 0.0

7424 rows × 2 columns

In [58]:
a=ohe.categories_
a
Out[58]:
[array(['Free', 'Paid'], dtype=object)]
In [59]:
ohe_df.columns=list(a[0])
In [60]:
ohe_df
Out[60]:
Free Paid
0 1.0 0.0
1 1.0 0.0
2 1.0 0.0
3 1.0 0.0
4 1.0 0.0
... ... ...
7419 1.0 0.0
7420 1.0 0.0
7421 1.0 0.0
7422 1.0 0.0
7423 1.0 0.0

7424 rows × 2 columns

In [61]:
df.reset_index(drop=True,inplace=True)
In [62]:
df_final=pd.concat((df,ohe_df),axis='columns',ignore_index=True)
In [63]:
df_final
Out[63]:
0 1 2 3 4 5 6 7 8 9
0 0 5.075174 19000000.0 10000 Free 0.0 1 0 1.0 0.0
1 0 6.875232 14000000.0 500000 Free 0.0 1 0 1.0 0.0
2 0 11.379520 8700000.0 5000000 Free 0.0 1 0 1.0 0.0
3 0 12.281389 25000000.0 50000000 Free 0.0 4 0 1.0 0.0
4 0 6.875232 2800000.0 100000 Free 0.0 1 0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ...
7419 3 3.806662 619000.0 1000 Free 0.0 1 0 1.0 0.0
7420 11 2.079442 2600000.0 500 Free 0.0 1 0 1.0 0.0
7421 11 3.663562 53000000.0 5000 Free 0.0 1 0 1.0 0.0
7422 11 1.609438 3600000.0 100 Free 0.0 1 0 1.0 0.0
7423 18 12.894981 19000000.0 10000000 Free 0.0 1 0 1.0 0.0

7424 rows × 10 columns

In [64]:
df_final.columns=list(df.columns)+list(ohe_df.columns)
In [65]:
df_final
Out[65]:
Category Reviews Size Installs Type Price Content Rating Rating_category Free Paid
0 0 5.075174 19000000.0 10000 Free 0.0 1 0 1.0 0.0
1 0 6.875232 14000000.0 500000 Free 0.0 1 0 1.0 0.0
2 0 11.379520 8700000.0 5000000 Free 0.0 1 0 1.0 0.0
3 0 12.281389 25000000.0 50000000 Free 0.0 4 0 1.0 0.0
4 0 6.875232 2800000.0 100000 Free 0.0 1 0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ...
7419 3 3.806662 619000.0 1000 Free 0.0 1 0 1.0 0.0
7420 11 2.079442 2600000.0 500 Free 0.0 1 0 1.0 0.0
7421 11 3.663562 53000000.0 5000 Free 0.0 1 0 1.0 0.0
7422 11 1.609438 3600000.0 100 Free 0.0 1 0 1.0 0.0
7423 18 12.894981 19000000.0 10000000 Free 0.0 1 0 1.0 0.0

7424 rows × 10 columns

In [66]:
df_final.drop(columns=ohe_col,inplace=True)
In [67]:
df_final.columns
Out[67]:
Index(['Category', 'Reviews', 'Size', 'Installs', 'Price', 'Content Rating',
       'Rating_category', 'Free', 'Paid'],
      dtype='object')

14. Segregate the target and independent features (Hint: Use Rating_category as the target)

In [68]:
y=df['Rating_category']
X=df_final.drop(columns='Rating_category')
In [69]:
X
Out[69]:
Category Reviews Size Installs Price Content Rating Free Paid
0 0 5.075174 19000000.0 10000 0.0 1 1.0 0.0
1 0 6.875232 14000000.0 500000 0.0 1 1.0 0.0
2 0 11.379520 8700000.0 5000000 0.0 1 1.0 0.0
3 0 12.281389 25000000.0 50000000 0.0 4 1.0 0.0
4 0 6.875232 2800000.0 100000 0.0 1 1.0 0.0
... ... ... ... ... ... ... ... ...
7419 3 3.806662 619000.0 1000 0.0 1 1.0 0.0
7420 11 2.079442 2600000.0 500 0.0 1 1.0 0.0
7421 11 3.663562 53000000.0 5000 0.0 1 1.0 0.0
7422 11 1.609438 3600000.0 100 0.0 1 1.0 0.0
7423 18 12.894981 19000000.0 10000000 0.0 1 1.0 0.0

7424 rows × 8 columns

In [70]:
y
Out[70]:
0       0
1       0
2       0
3       0
4       0
       ..
7419    0
7420    0
7421    0
7422    0
7423    0
Name: Rating_category, Length: 7424, dtype: int32

15. Split the dataset into train and test.

In [71]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=555)
X_train
Out[71]:
Category Reviews Size Installs Price Content Rating Free Paid
5331 12 8.364275 24000000.0 50000 0.0 1 1.0 0.0
7378 25 2.890372 2400000.0 1000 0.0 1 1.0 0.0
7369 19 5.814131 676000.0 10000 0.0 1 1.0 0.0
3663 26 9.915811 47000000.0 5000000 0.0 1 1.0 0.0
123 3 11.412763 5900000.0 5000000 0.0 1 1.0 0.0
... ... ... ... ... ... ... ... ...
2628 6 9.644717 2200000.0 1000000 0.0 1 1.0 0.0
1057 14 10.999680 7800000.0 5000000 0.0 1 1.0 0.0
7145 14 12.677123 27000000.0 50000000 0.0 4 1.0 0.0
4782 27 1.791759 1800000.0 5 0.0 1 1.0 0.0
6554 14 13.174189 49000000.0 10000000 0.0 2 1.0 0.0

5568 rows × 8 columns

In [72]:
X_test
Out[72]:
Category Reviews Size Installs Price Content Rating Free Paid
4515 11 7.547502 72000000.0 50000 0.0 1 1.0 0.0
7003 4 3.332205 18000000.0 5000 0.0 1 1.0 0.0
1767 26 10.477232 16000000.0 1000000 0.0 1 1.0 0.0
2169 25 8.701513 4300000.0 500000 0.0 1 1.0 0.0
3420 11 7.355641 35000000.0 50000 0.0 1 1.0 0.0
... ... ... ... ... ... ... ... ...
4666 11 6.242223 9800000.0 50000 0.0 1 1.0 0.0
3662 16 4.779123 72000000.0 50000 0.0 1 1.0 0.0
6416 11 8.187021 19000000.0 500000 0.0 1 1.0 0.0
3232 25 8.538563 1400000.0 500000 0.0 1 1.0 0.0
5169 21 2.079442 3400000.0 1000 0.0 2 1.0 0.0

1856 rows × 8 columns

In [73]:
y_train
Out[73]:
5331    0
7378    0
7369    0
3663    0
123     0
       ..
2628    0
1057    0
7145    0
4782    0
6554    0
Name: Rating_category, Length: 5568, dtype: int32
In [74]:
y_test
Out[74]:
4515    0
7003    0
1767    0
2169    0
3420    0
       ..
4666    0
3662    0
6416    0
3232    0
5169    0
Name: Rating_category, Length: 1856, dtype: int32
In [75]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape
Out[75]:
((5568, 8), (1856, 8), (5568,), (1856,))

16. Standardize the data, so that the values are within a particular range.

In [76]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit_transform(X,y)
Out[76]:
array([[-2.03766618, -0.69434673, -0.15992777, ..., -0.46322046,
         0.28202925, -0.28202925],
       [-2.03766618, -0.20638643, -0.37330014, ..., -0.46322046,
         0.28202925, -0.28202925],
       [-2.03766618,  1.01463714, -0.59947486, ..., -0.46322046,
         0.28202925, -0.28202925],
       ...,
       [-0.68621673, -1.07700695,  1.29100439, ..., -0.46322046,
         0.28202925, -0.28202925],
       [-0.68621673, -1.63383939, -0.81711468, ..., -0.46322046,
         0.28202925, -0.28202925],
       [ 0.17379656,  1.42544875, -0.15992777, ..., -0.46322046,
         0.28202925, -0.28202925]])
In [ ]: